Optimal Primary-Backup Protocols 


Navin Budhiraja* 

Keith Marzullo* 

Fred B. Schneider** , 

Sam Toueg*** <:■*/?/)**? 

TR 92-1299 b 

August 1 992 



Department of Computer Science 
Cornell University 
Ithaca, NY 14853-7501 


‘Supported by Defense Advanced Research Projects Agency (DoD) under NASA 
Ames grant number NAG 2-593 and by grants from IBM and Siemens. 

“Supported in part by the Office of Nava! Research under contract N00014-91-J- 
1219, the National Science Foundation under Grant No. CCR-8701103, DARPA/NSF 
Grant No. CCR-9014363 and by a grant from IBM Endicott Programming Laboratory. 
‘“Supported in part by NSF grants CCR-8901780 and CCR-91 02231 and by a grant 
from IBM Endicott Programming Laboratory. 



Optimal Primary- Back up Protocols 


Navin Budhiraja*, Keith Marzullo*, Fred B. Schneider**, Sam Toueg*** 
Department of Computer Science, Cornell University, Ithaca NY 14853, USA 


Abstract. We give prim ary -backup protocols for various models of fail- 
ure. These protocols are optimal with respect to degree of replication, 
failover time, and response time to client requests. 


1 Introduction 

One way to implement a fault- tolerant ser vice is to employ multiple sites that 
fail independently. The state of the service is replicated and distributed among 
these sites, and updates are coordinated so that even when a subset of the sites 
fail, the service remains available. 

A common approach to structuring such replicated services is to designate 
one site as the primary and all the others as backups. Clients make requests by 
sending messages only to the primary. If the primary fails, then a. failover occurs 
and one of the backups takes over. This service architecture is commonly called 
the primary-backup or the primary-copy approach [1]. 

In [5] we give lower bounds for implementing primary-backup protocols under 
various models of failure. These lower bounds constrain the degree of replication, 
the time during which the service can be without a primary, and the amount of 
time it can take to respond to a client request. In this paper, we show that most 
of these lower bounds are tight by giving matching protocols. 

Some of the protocols that we describe have surprising properties. In one 
case, the optimal protocol is one in which a non-faulty primary is forced to 
relinquish control to a backup that it knows to be faulty! However, the existence 
of such a scenario is not peculiar to our protocol. As shown in [5], relinquishing 
control to a faulty backup is indeed necessary to achieve optimal protocols in 
some failure models. Another surprise is that in some protocols that achieve 
optimal response time, the site that receives the request ( i.e . the primary) is 
not the site that sends the response to the clients. We show that this anomaly is 
not idiosyncratic to our protocols — it is necessary for achieving optimal response 
time. 
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The rest of the paper is organized as follows. Section 2 gives a specification for 
primary-backup protocols, Sect. 3 discusses our system model, Sect. 4 summa- 
rizes the lower bounds from [5], and Sect. 5 summarizes our results. Sections 6, 
7 and 8 describe the protocols that achieve our lower bounds, and Sect. 9 de- 
scribes a protocol in which the primary is forced to relinquish control to a faulty 
backup. We conclude in Sect. 10. Due to lack of space, the description of some of 
the protocols and all proofs are omitted from this paper. See [4] for a complete 
description and proofs. 


2 Specification of Primary-Backup Services 

Our results apply to any protocol that satisfies the following four properties, 
and many primary-backup protocols in the literature (e.g. [1,2,3]) do satisfy 
this characterization. 

Pbl: There exists predicate Prmy s on the state of each site s. At any time, there 
is at most one site s whose state satisfies Prmy s . 

Pb2: Each client i maintains a site identity Desti such that to make a request, 
client i sends a message (only) to Dest{. 

For the next property, we model a communications network by assuming that 
client requests are enqueued in a message queue of a site. 

Pb3: If a client request arrives at a site that is not the primary, then that request 
is not enqueued (and is therefore not processed by the site). 

A request sent to a primary-backup service can be lost if it is sent to a faulty 
primary. Periods during which requests are lost, however, are bounded by the 
time required for a backup to take over as the new primary. Such behavior is 
an instance of what we call bofo (bounded outage finitely often). We say that 
an outage occurs at time t if some client makes a request at that time but does 
not receive a response 1 2 * . A (&, A)-bofo server is one for which all outages can 
be grouped into at most k periods, each period having duration of at most A 7 . 
The final property of the primary-backup protocols is that they implement a 
bofo-server (for some values of k and A). 

Pb4: There exist fixed and bounded values k and A such that the service behaves 
like a single (fc, zl)-bofo server. 

Clearly, Pb4 can not be implemented if the number of failures is not bounded. 
In particular, if all sites fail, then no service can be provided and so the service 
is not ( k } A ) for any finite k and A . 

1 For simplicity, we assume in this paper that every request elicits a response. 

2 Therefore, as well as being finite, the number of such periods of service outages can 

occur is also bounded (by £). 


3 The Model 


Consider a system with n 8 sites and n c clients. Site clocks are assumed to be 
perfectly synchronized with real time 3 . Clients and sites communicate through a 
completely connected, point-to-point, FIFO network. Furthermore, if processes 
(clients or sites) pi and pj are connected by a (nonfaulty) link, then we assume 
for some a priori known 5, a message sent by pi to pj at time t arrives at pj at 
some time t* E (t..t + S\. 

We assume that all clients are non-faulty and consider the following types 
of site and link failures: crash failures (faulty sites may halt prematurely; until 
they halt, they behave correctly) 4 , crash +Iink failures (faulty sites may crash or 
faulty links may lose messages), receive-omission failures (faulty sites may crash 
or omit to receive some messages), send-omission failures (faulty sites may crash 
or omit to send some messages), general-omission failures (faulty sites may fail 
by send-omission, receive-omission, or both). Note that link failures and the 
various types of omission failures are different only insofar as a message loss is 
attributed to a different component. Link failures are masked by adding redun- 
dant communication paths; omission failures are masked by adding redundant 
sites. As we will see, the lower bounds for the two cases are different. 

Let / be the maximum number of components that can be faulty (:.e. / is 
the maximum number of faulty sites in the case of crash, send-omission, receive- 
omission and general-omission failures, whereas / is the maximum number of 
faulty sites and links in the case of crash+link failures). 


4 Lower Bounds 


In Tab. 1, we repeat the lower bounds from [5] for the degree of replication, the 
blocking time and the failover time for the various kinds of failures. Informally, 
a protocol is C -blocking if in all failure-free runs, the time that elapses from 
the moment a site receives a request until a site sends the associated response 
is bounded by C. 5 Failover time is defined to be the longest duration (over all 
possible runs) for which there is no primary. However, the failover time bounds 
only hold for protocols that satisfy the following additional (and reasonable) 
property. 

Pb5: A correct site that is the primary remains so until there is a failure. 


3 The protocols can be extended to the more realistic model in which clocks are only 
approximately synchronized [7], 

4 The lower bounds are also tight for fail-stop failures [10] except for the bound on 
failover time. 

5 We assume that it takes no time for a site to compute the response to a request. 


Table 1 . Lower Bounds — Degree of Replication, Blocking Time and Failover Time 


Failure type 

Replication 

Blocking time ( C ) 

Failover Time 

Crash 

n, > f 

0 

fs 

Crash -|- Link 

n 5 > / + 1 

0 

2 fS 

Send-Omission 

n, > / 

5 if / = 1 
25 if / > 1 

2 f6 

Receive-Omission 

n * > l¥j 

S if n s < 2/ and / = 1 
28 if rii < 2/ and / > 1 
0 if n, > 2/ 

2/5 

General-0 mission j 

^ n, > 2/ 

5 if / = 1 
26 if / > 1 

2/<5 


5 Summary of Results 


We first present a primary-backup protocol schema that will be used to derive 
the protocols for all the failure models. This schema is based on the properties of 
two key primitives, broadcast and deliver, that sites use to exchange messages. 
We show that the schema satisfies Pbl — Pb5 by only using these properties inde- 
pendent of the particular failure model. Each failure model — crash, crash +link, 
send-omission, receive-omission and general-omission— is handled with a differ- 
ent implementation of broadcast and deliver, and in all but one case optimal 
protocols are constructed. 

The protocols for crash and crash-f link failures show that all the correspond- 
ing lower bounds are tight. The protocol for general-omission failures uses a 
translation technique similar to [8], and demonstrates that our lower bounds for 
general-omission failures are tight, except for the bound on blocking time when 
/ = 1. However, for this special case we have derived a different protocol (not 
described in this paper) having optimal blocking time . In all failure free runs of 
this protocol, the site that receives the request (i.e. the primary) is not the site 
that sends the response to the client. We show that this behavior is necessary in 
this paper. 

We do not show the protocols for send-omission and receive-omission fail- 
ures in this paper because they are similar to the protocol for general-omission 
failures. These protocols establish that the bounds for send-omission failures 
are tight. For receive-omission failures, the lower bound on blocking time when 
n s > 2/ and the lower bound on failover time are also tight. However, our pro- 
tocol does not have optimal replication, as it requires n s > 2/ (rather than 

> L¥J)- 

Finally, in [5] we proved that all receive-omission protocols having [^J < 
n 9 < 2/ necessarily exhibit a scenario in which a non-faulty primary Is forced 
to relinquish control to a faulty backup. In Sect. 9, we describe such a protocol: 
it uses two sites and tolerates a single receive-omission failure. In addition, this 
















protocol is 5-blocking and so it demonstrates that our lower bound on blocking 
time is tight for n s < 2/ and / = 1. As in the protocol for general omission when 
/ = 1, it is the backup that sends responses to clients. This behavior is shown 
to be necessary for an important class of protocols. 

6 Protocols for the Clients and the (k, A)-bofo server 

Property Pb4 requires that the primary-backup service behave like some ( k , A)- 
bofo server. Figure 1 gives such a canonical ( k , zi)-bofo server (say s), and Fig. 2 
gives the protocol for client i interacting with s. As with any other bofo server, 
a client will not receive the response to a request if either the request to s or the 
response from s is lost. 


initialize() 

cobegin 

|| inform“clients( u Dc5t = 5 ”) 

|| do forever 

when received request from client c 
response := II (state, request) 
state = state o response 
send response to client c 
od 
coend 

procedure initialize() 
state ;= e 

procedure inform-clients(ic) 
send (ic) to all clients 

Fig. 1. Protocol run by a single (k, A) bofo-server s 


In Fig. 2, response-time corresponds to the amount of time the client has to 
wait in order to get the response from s t which is just the round trip message 
delay. The exact value for response-time depends on the failure model being 
assumed. 

7 The Primary-Backup Protocol Schema 

We first make the simplying assumption that the links between the clients and 
the sites are non-faulty and there are no omission failures between the clients and 
the sites (i.e. only the links between sites can be faulty for crash+link failures, 



cobegln 

|| do forever 

If received “Dest = s” then 
Dest t := s 
od 

|| do forever 


if want to send request 
send request to Dest t 

if not received response by response-time then 

recoverQ j* call some recovery procedure, which might retry */ 
else 

od 

coend 

Fig. 2. Protocol run by client i interacting with server s 


and omission failures can occur only between sites for omission failures). We 
show in Sect. 7.1 how this assumption can be removed. 

In order to emulate the server s (and consequently satisfy property Pb4), our 
primary-backup protocol consists of n s sites {s \ y . . . ,s n# }> each of which runs 
the protocol in Fig. 3. The protocol for the clients remains the same. 


initialize(i) 

cobegin 

|| if t = 0 then primary (i) else backup(») 

|| delivery-process(i) 

|| failure-detector(i) 
coend 

Fig. 3. Protocol run by site s x to emulate server s 


The procedures primary and backup (shown in Fig. 4) are the same for all 
the failure models. On the other hand, the implementation of the procedures ini- 
tialize, broadcast (used in Fig. 4), delivery-process and failure— detector 
change depending on the particular failure model. However, we ensure that these 
different implementations always satisfy a set of properties, called B1 — Bll be- 
low. We extracted these properties in order to make our proofs modular. In 
particular we proved that, independent of the failure model, the protocol in 
Figs. 3 and 4 satisfies Pbl-Pb5, as long as the remaining procedures satisfy 
B1 — Bll. As a result, we could then prove Pbl-Pb5 for any other failure model 




by just ensuring that the implementation of broadcast, delivery-process and 
failure-detector for that failure model satisfied Bl-Bll. 


procedure primary (j) 
cobegin 

|| inform-clients(“I)es£ = Sj”) 

j| broadcast((mylastlog, /as*(s*atej )),.;) /* to all sites */ 
do forever 

when received request from client c 
response := II (state } > request) 
statej := statCj o response 
broadcast((log, response), j) 
send response to client c 
od 
coend 

procedure backup(fc) 
do forever 

((tag, s 3 ,r),j) := Deq(/??ueae*) 

/* assume that dequeueing an empty queue 
does not return any sensible value of tag */ 

/* synchronizing with the new primary */ 
if tag = mylastlog then 
if r G statek then 
if r = last(statek) then skip 
else statek := states \ last(statek) 
else statek := statek o r 

/* logging response from primary */ 
if tag = log then statek := sfafe^ o r 

/* becoming the primary */ 
if Vj < ^ : Faulty k[sj] then primary (k) 
od 


Fig. 4. The procedures primary and backup 


We now give the properties B1 — Bll. In these properties, d ) C and r are some 
constants whose values depend on the failure model. Intuitively, d corresponds 
to the amount of time that can elapse from the time a message is broadcast to 
the time it is dequeued by the receiver, C corresponds to the blocking time and 
r corresponds to the interval between successive “I am alive” messages that sites 
send to each other (as we will see in the implementation of failure-detector). 


When we say that a site “halts”, we mean that either the site has crashed or 
has stopped executing the protocol by executing a stop. The array of booleans 
Faultyk indicates which servers s* believes has halted: Faulty k [s ; -] being true 
implies s fc believes that Sj has halted. Finally, we define a broadcast by a site to 
be successful if the site does not halt during the execution of broadcast. 

The properties can be subdivided according to the procedures to which they 
relate: 

Properties of broadcast and delivery-process: 

Bl: If Sj initiates a broadcast b f after broadcast 6, then no site dequeues 6' 
before 6. 

B2: If Sj initiates a broadcast b at time t y then no site dequeues b after time 
t -f- d. 

B3: If Sj initiates a broadcast at time t and does not halt by time t + C, then 
the broadcast is successful. Furthermore, no broadcast takes longer than C 
to complete. 

Properties of failure— detector: 

B4: If Faultyj [$*] becomes true, then it continues to be true, unless Sj halts. 
B5: The value of Faultyj [s*] can only change at time t = It + d for some integer 
/> 0 . 

B6: If Faultyj[sk } = true at time t then s* has halted by time t . 

B7: If Sj has not halted by time t\, and s,-, i < j has halted by time t 2 where 
1 1 = t 2 + r -f d, then Faultyj [$i] = true by time t\ . 

Properties of broadcast and delivery-process interacting with failure-de- 
tector: 

B8: No correct site halts in procedures initialize, broadcast, delivery-process 
or failure-detector. 

B9: If Sj initiates a successful broadcast at time t, then for all non-halted sites 
s*, k > j , Faultyk[sj] = false through time \±]r+ d. 

BIO: If Sj initiates a successful broadcast b } then for every non-halted site s*: 

( Faultyk[sj ] = true) => (s* has dequeued b). 

Bl 1: If Sj initiates a broadcast b at time t and Sk>k > j broadcasts then 
either no site dequeues b after b 1 , or Faultyk[sj] = false through time t + d. 

7.1 Outline of the Proof of Correctness 

We now informally argue that the protocol in Figs. 3 and 4 satisfies Pbl-Pb5 as 
long as the procedures initialize, broadcast, delivery-process and failure- 
detector satisfy Bl — Bll. 

Define: Prmy Sj at time t = Sj has not halted by time t 

A Vfc < j : Faulty j[sk ] = true at time t. 

From the above definition, Pbl can now be seen from B6 and the backup 
protocol in Fig. 4. Pb2 trivially follows from Fig. 2. Pb3 follows from Fig. 4 as 


no request is sent to a site $j before Sj becomes the primary. Also, Pb5 holds 
(from B8 and Fig. 4) as a correct primary continues to be the primary. We now 
show Pb4. 

In order to show Pb4, we need to show two things-the state of the new 
primary is consistent with the state of old primary; and all outages are bounded. 
We first show that the states are consistent. 

Starting at the top of Fig. 4: when a site Sj becomes the primary, it first 
informs the clients of its identity by calling inform-clients. For now, ignore 
the broadcast of (my last log, Sj,-) by primary Sj, 

Whenever Sj gets a request from a client, it computes the response, changes 
state, broadcasts the log to the backups and sends the response back to the 
client. It can be seen from Fig. 4 that if primary Sj sends a response r to the 
client, then Sj must have executed a successful broadcast of (log, Sj , r). This fact 
and properties B1,B2,B9 and BIO imply that (log,Sj,r) must also have been 
dequeued by any backup s* before s* becomes the primary. Thus, the state of s* 
will continue to be consistent with the state of Sj iff the states were consistent 
when Sj became the primary. We show this as follows. 

Informally, the states of Sj and s* could be inconsistent when Sj becomes 
the primary for the following reason. Consider a scenario in which some primary 
Si crashes during the broadcast of (log, 5^, r) for some r. It is possible that s k 
received (log, s,-, r) and Sj did not. As a result, the states of Sj and s* now differ. 
It is for this reason that sj broadcasts (mylastlog, Sj , r') where r' = last(statej) 
on becoming the primary. On receiving this, s* sees that r 7 ^ last(statek) = 
r and removes r from its state. As a result, statej and states become equal. 
Similarly, $k would add r to its state had s^, and not s*, received (log, s t *, r). 

In the scenario described in the last paragraph, response r is never sent to 
the client (i.e. there is a service outage). We now show that such outages are 
bounded. S{ did not send the response, and so by B3, must have halted by time 
t (say). Now from B7 either s,+i halts or becomes the primary by time t + r + <5. 
Since no correct site halts (by B8 and Fig 4), and the number of faulty sites are 
bounded by /, there eventually will be a time when there is a correct primary 
and no more outages occur. 

From B3, the protocol C-blocking. Furthermore, it can be shown from B7, 
B8 and Fig. 4 that the failover time of the protocol is f(d + r) for arbitrarily 
small and positive r. 

However, the primary procedure in Fig. 4 does not work if there are message 
losses between the clients and the sites (due to link or omission failures). For 
example, a non-faulty primary might omit to receive all requests from a client 
due to a failure, violating Pb4. Similarly, inform-clients might omit to inform 
some of the clients. However, it is relatively easy to account for these failures 
when clients are non-faulty. Assume that there is an upper bound (say G) be- 
tween any two requests from a client and that requests carry sequence numbers. 
If the primary does not receive any requests from a client during an interval of 
length G or if the primary receives some request with a sequence number gap, 
then the primary halts. Similarly, the primary can detect that a response was 



lost by having clients acknowledge responses. If such an acknowledgement is not 
received, then again the primary halts. Properties Pbl-Pb5 can again be shown 
to be true if we make the above modification in Figs. 2 and 4. 


8 Implementation for the various Failure Models 

In this section, we show how to implement Bl— Bll for the various failure mod- 
els. 


8.1 Crash Failures 

The procedures implementing Bl — Bll for crash failures are given in Fig. 5. 
Whenever we say that a site “delivered M ” , we mean that the procedure deliver 
has been called with M . Enq adds an element to the head of a queue and Deq 
dequeues an element from the tail. 


procedure mitiallze(fc) 
statek := Rqueuek := e 
Vi : Faultyk[si] := false 

procedure broadcast(M, £) 
send M to all sites 

procedure deliver (Af, k) 

Let M be of the form (tag, — , — ) 

if tag 6 {log, mylastlog} then Enq( ifyueue* , (M y k)) 

procedure delivery-process(i) 
do forever 

if received M then deliver(Af, k) 
od 

procedure failure-detector(fc) 
cobegin 

|| for i 0 to oo 

when current-time = ir: send (alive, s*, ir) to all sites 
|| for i := 0 to oo 

when current-time = ir + d\ 

Vj : if not delivered (alive, Sj,tr) then Faultyk[s } ] := true 

coend 

Fig. 5. Procedures for crash failures 



We now informally argue that Bl-Bll hold for this implementation if d = 5 
and C = 0. B1 holds as channels are FIFO and, B2 holds as d = 5 and the 
maximum message delivery time is also 5. B3, B4 and B5 can be seen trivially. 
B6 and B7 can be seen from failure— detector as there are no message losses 
and message delivery time is atmost 5. B8 holds trivially. It can be shown that 
if sj halts at time t , then no site sets Faulty[sj ] to true before time t + 5. B9, 
BIO and Bll now follow. 

The procedures in Fig. 5 require n s > /, and so the lower bound on the 
degree of replication is tight. Since C — 0 and d — 5, from Sect. 7.1, the lower 
bounds on blocking time and failover time are tight as well. 


8.2 Crash+Link Failures 

The procedures in Sect. 8.1 do not work if links can fail. For example, if Sj 
sends a message to s* then the message might not reach s* due to a link failure 
(which will violate B6 and BIO). We therefore replace the implementation in 
Fig. 5. with the one in Fig. 6, except that deliver is the same as before. For 
this implementation, d = 25 and C = 0. These procedures use fifo-broadcast 
and fifo-deliver in Fig. 7 which ensure that intermittent link failures become 
permanent failures: if Sj fifo-broadcasts a message m to s* and omits to 
fifo-deliver m, then s* will not fifo -deliver an y subsequent message from Sj . 


It can be shown (proof omitted) that this new implementation again satisfies 
Bl-Bll if n s > / + 1. Informally, this is true because of the following reason. 
Whenever Sj initiates a broadcast of M at time t , it sends M to all sites, and the 
sites then relay M to all other sites. Since n. > / + 1, there is always at least one 
non-faulty path between any two non-crashed sites, where a path consists of zero 
or one intermediate sites. Therefore, if Sj does not crash during the broadcast, 
then all non-crashed sites will deliver Mby time t + 26. Furthermore B1 will be 
satisfied because of the FIFO properties of fifo-broadcast and fifo-deliver. 

This crash-blink protocol requires n s > /+ 1, is 0-blocking (since (7 = 0), 
and has a failover time of f(2S + r) (since d = 25). Thus, all lower bounds for 
crash-blink failures are tight. 


8.3 General-Omission Failures 

The implementation of the procedures for general-omission failures is given in 
Figs. 8 and 9, except delivery-process which is the same as Fig. 6. Whenever, 
we say that a site “fifo-delivered M ” , we mean that the procedure fifo-deliver 
was called with M. These procedures were developed using a technique similar 
to [8] (although modified to work in our non-round-based model) which requires 
n 8 > 2/ and d = 25. 


procedure initialize^) 

statek := Rqueuek := Dqueuek e 
V* : Faultyk[si] := /a/se 
/asf-$entfc := Vj : expected*^] := 0 

procedure broadcast(M, £) 
time := current- time 
flfo-broadcast(init, M, s*, time) 

procedure delivery-process(fc) 
cobegin 

|| fifo-delivery-process(£) 

|| do forever 

(tag, M , — , t) :=Deq( J DgtieueA f ) 

if tag = init then fifo- broadcast (echo, Af,s*,t) 

if tag — echo and not dequeued (tag, M, —,t) before then deliver (M, k) 
od 
coend 

procedure failure-detector(fc) 

Aj = (alive, s J} tr) 
cobegin 

|| for i : — 0 to oo 

when current- time ss ir: fifo-broadcast(init, A ' h , s*, ir) 

|| for i := 0 to oo 

when current-time — ir + d: 

V; : if not delivered A ) then Faulty k[sj] := true 

coend 

Fig. 6. Procedures for crash-j-Iink failures 


procedure fifo-broadcast (tag, M, s*, t) 
send (tag, M , Sk,t, last- sentk) to all 
last-sentk :—last-sentk + 1 

procedure flfo-deliver (tag,M, t) 
Enq (Dqueuek, (tag, M, 3j,t)) 

procedure fifo-delivery-process (k) 
do forever 

if received (tag, M, s Jt t , last } ) then 
if (lastj ^ expectedk[j ]) then skip 
else 

ex pectedk\j] :== expectedk[j] + 1 
flfo-deliver (tag, M, s ; , t) 
od 


Fig. 7. Procedures for crash-j-link failures 




procedure initialize(fc) 
statek := Rqueuek := Dqueutk := e 
Vi : Faultyk[si] : = /a/$e 

currenf-primary:==/cMt-,$erjh : := Vj respected*^’] := 0 

procedure broadcast(Af, A:) 
time := current-time 
fifo-broadcast ( ini t, M , s*, time) 
if by time + d fifo-delivered (echo, M, s Jy time) 
for at least n 4 — / different j then return 
else stop 

procedure deliver (M, k) 

Let M be of the form (tag, s )y — ) 
if tag £ {log, mylastlog} then 

if j < current-primary then return 
else 

current-primarxf.= j 
Enq (Rqueuek, (M, A:)) 

Fig. 8. Procedures for general-omission failures 


We now briefly argue that these procedures satisfy B1 — Bll. The detailed 
proof is omitted from this paper. Had we used the implementation of broadcast 
in Fig. 6, BIO (in particular) would be violated because a faulty primary s } 
might omit to send the logs to the backups. Therefore, in Fig. 8, sj stops in 
the broadcast of a response (say r) if less than n g - / sites fifo-deliver and 
subsequently fifo-broadcast r. However, even if Sj does not stop in the broadcast, 
a faulty (but non-crashed) site Sk might still omit to deliver r, due to a receive- 
omission failure, and later become the primary were Sj to fail. To prevent this, s* 
ensures (in procedure failure— detector) that it fifo-delivers some message (say 
m ; ) from at least one of the above n s — f sites that had earlier fifo-broadcast r. If 
Sjfc does not receive such an m / , then s* stops. Now, if s* omitted to fifo-deliver 
r, then by the properties of fifo-broadcast and fifo-deliver, s* cannot fifo-deliver 
m l and would stop (and, therefore, cannot become the primary). Property B6 
is similarly satisfied by ensuring that sites detect their own failure to send or 
receive alive messages and therefore stop. 

These procedures require n s > 2/, d = 26 and C = 26. Furthermore, we have 
developed a protocol for / = 1 (omitted in this paper) that is <5-blocking. Thus, 
we establish that all lower bounds for general-omission failures are tight. 

As mentioned earlier, the <5-blocking protocol for / = 1 has scenarios in which 
the site that receives the request is not the site that responds to the clients. This 
is in fact necessary. Define a protocol to be “pass the buck” if in any failure-free 
run of the protocol, the site that receives a request is not the site that sends the 
corresponding response. 



procedure failure-detector (A;) 

Vi, j : A\ := (alive, 5j,ir) 

Vi,i : F) := (fault, s^ir) 
cobegin 

[| for i 0 to oo 

when current-time = ir: fifo-broadcast(init, A\, 3k, ir) 

|| for i := 0 to oo 

when current-time = ir + 6: 

Vj : if not fifo-delivered (init, A}, ir) then 
fifo-broadcast (echo, Fj, s*, ir) 

|| for i := 0 to oo 

when current-time — ir -j- d: 

witnes3k[k] ■= {sj|fifo-deIivered (echo, A*, ir)} 

V; / * : witness k [j] := {$i|fifo-delivered (echo, A}, ir) or 
fifo-delivered (echo, F] , a,, ir)} 
if 3j : |u;itne5afc[y]| < n 5 — / then stop 
if 3j : not delivered A* then Faultyk[s 3 ] := true 

coend 

Fig. 9. Procedures for general-omission failures 


Theorem 1. Any C -blocking protocol, where C < 2 6, for send-omission failures 
is “pass the buck”. 

Proof. Omitted in this paper. See [4]. □ 

8.4 Other Failure Models 

The implementations of the procedures for send-omission and receive-omission 
failures are similar to those for general-omission failures and so are omitted 
from this paper. For receive-omission failures, the lower bound on the degree 
of replication and the lower bound on blocking time when n s < 2/ and / > 1 
are not tight. Finding optimal protocols remains an open problem. However, the 
lower bound on failover time for receive-omission failures, and all lower bounds 
for send-omission failures are tight. 

9 A Surprising Protocol 

We now describe a (5-blocking protocol tolerating receive-omission failures for 
the special case of n 3 = 2 and / = 1. This protocol is complex, and so we omit 
the detailed description and only outline the protocol^ operation here. This 
protocol shows that our lower bound on blocking time when n s < 2/ and / = 1 
is tight. The protocol has the odd (yet necessary as shown in [5]) property that 
a non-faulty primary is forced to relinquish to a faulty backup. Furthermore, the 
protocol is “pass the buck”. We, however, show that most ^-blocking protocols 
tolerating receive omission failures have to be “pass the buck” . 



Informally, let r be the maximum time between any two successive client 
requests (possibly from different clients), and let D be such that if some site s 
becomes the primary at time t 0 and remains the primary through time t > t$ + D 
when a client i sends a request, then Desti — s at time t. We write D < T to 
mean that D is bounded and F is either unbounded or bounded and greater 
than D. Then 

Theorem 2. Any C -blocking protocol, where C < 2<5, for receive- omission fail- 
ures with 7T S < 2/ and D < r is a pass the buck”. 

Proof Omitted from this paper. □ 

Whether a protocol has to be “pass the buck” when the relation D < T does 
not hold is an open question. 

We now describe the protocol. There are two sites Sq and Si. They commu- 
nicate with each other using fifo— broadcast and fifo— deliver shown in Fig. 7. 
Henceforth, when we say that a site sends a message to the other, we will mean 
that the message is sent with fifo-broadcast and other site receives it with 
fifo- deliver. 

In a failure-free run of this protocol, since the backup responds to the client, 
the primary forwards any response to the backup (with a green tag as we see 
below) and the backup sends this response to the client. However, if there is 
a failure, then the primary responds to the clients. In this case, the primary 
forwards a response to the backup with a red tag. The backup does not forward 
a response to the client if the response has a red tag. 

Let s 0 initially be the primary. Whenever s 0 receives a request from the 
client, it computes a response r, changes state, and sends (green, r) to s\. Upon 
receiving this message, S\ updates its state, acknowledges to $o» and then sends 
r to the client. Because it is the backup that responds to the client, the protocol 
is 5-blocking. Site so processes a new request only after receiving the acknowl- 
edgement from si for the previous request. Finally, s 0 periodically sends alive 
messages to and s i acknowledges these messages. 

Suppose that s 0 does not get si’s acknowledgement for some message, say, 
(green, r) (the argument is similar if no acknowledgement is received for an 
alive message). There are three possibilities: (1) s\ has crashed, (2) si omitted 
to receive (green, r) and so did not send the acknowledgement, (3) so omitted to 
receive the acknowledgement, so now waits until it is supposed to send the next 
alive message, sq sends this alive message and waits for an acknowledgement. 
We now consider the above three cases separately. 

Case 1: si has crashed. As a result, sq does not receive the acknowledgement 
to the alive message. s 0 continues as the primary. From then on, whenever sq 
receives a request from the client, it computes the response r, sends (red ,r) to 
Si , and then sends the response back to the client. Also, So continues to send 
alive messages. Since so is correct, it can continue like this forever. 

Case 2: s\ is faulty and omitted to receive (green, r). By the property of fifo- 
broadcast and fifo-deliver, Si will not receive the alive messages that so sends. 



$1 concludes that s 0 has crashed, sends (“s\ is primary”) to sq and becomes the 
primary. After that, it behaves like sq in case 1 above (including sending alive 
messages to sq). Since Sq is correct, it receives ( <c s i is primary”) (as opposed to 
case 1) and so it becomes the backup. Also, since sq is correct it will not omit 
to receive (red,r) messages that s\ sends and so sq keeps its state consistent 
with si. Subsequently, if sq stops receiving alive messages from s x , then s i has 
crashed and sq becomes the primary once again. 

Case 3: s 0 is faulty. Since sj is correct, it receives the alive message from so, 
sends the corresponding acknowledgement and remains the backup (as opposed 
to case 2). However, by the property of fifo-broadcast and fifo-receive, sq will 
not receive this acknowledge to the alive message (or the (“ s x is primary”) 
message), and so It behaves as in case 1 and continues as the primary. Similar to 
case 2, s x receives all (red ,r) messages that So sends and so its state is consistent 
with so> Finally, s i becomes the primary if it stops receiving alive messages 
from sq. 

Case 2 in the protocol is the odd scenario in which the correct primary so 
is being forced to relinquish to sp, known to be faulty. However, this scenario is 
not something peculiar to our protocol. We showed in [5] that relinquishing to a 
faulty backup is necessary when n s < 2/. 


10 Discussion 

In [5], we present lower bounds for primary-backup protocols which constrain 
the degree of replication, the failover time, and the amount of time it can take 
to respond to a client request. In this paper, we derive matching protocols and 
show that all except two of these lower bounds are tight. Furthermore, we show 
that in some cases the optimal response time can only be obtained if the site 
that receives the request is not site that sends the response to the clients. 

We have attempted to give a characterization of primary-backup that is broad 
enough to include most synchronous protocols that are considered to be instances 
of the approach. There are protocols, however, that are incomparable to the class 
of protocols we analyze as these protocols were developed for an asynchronous 
system [6,9]. We are currently studying possible characterizations for a primary- 
backup protocol in an asynchronous system and expect to extend our results to 
this setting. 
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